Let us consider the dataframe mtcars, which comprises the fuel consumption and 10 aspects of design and performance for 32 automobiles (1970s models). The help file is given below
| mtcars | R Documentation |
A data frame with 32 observations on 11 (numeric) variables.
| [, 1] | mpg | Miles/(US) gallon |
| [, 2] | cyl | Number of cylinders |
| [, 3] | disp | Displacement (cu.in.) |
| [, 4] | hp | Gross horsepower |
| [, 5] | drat | Rear axle ratio |
| [, 6] | wt | Weight (1000 lbs) |
| [, 7] | qsec | 1/4 mile time |
| [, 8] | vs | Engine (0 = V-shaped, 1 = straight) |
| [, 9] | am | Transmission (0 = automatic, 1 = manual) |
| [,10] | gear | Number of forward gears |
| [,11] | carb | Number of carburetors |
Describe how to perform a preliminary data analysis on this dataframe, using suitable R commands.
Si vuole creare un modello statistico di regressione per analizzare i consumi medi di alcuni modelli di auto presenti nel dataframe mtcars.
La variabile risposta è mpg che rappresenta il numero di miglia (americane) percorse (in media) con un gallone di carburante.
Analizziamo le variabili
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Ci sono alcune variabili da convertire
mtcars$cyl <- as.integer(mtcars$cyl)
mtcars$hp <- as.integer(mtcars$hp)
mtcars$vs <- factor(mtcars$vs)
levels(mtcars$vs) <- c("V-shaped", "straight")
mtcars$am <- factor(mtcars$am)
levels(mtcars$am) <- c("automatic", "manual")
mtcars$gear <- as.integer(mtcars$gear)
mtcars$carb <- as.integer(mtcars$carb)
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : int 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : int 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : Factor w/ 2 levels "V-shaped","straight": 1 1 2 2 1 2 1 2 2 2 ...
## $ am : Factor w/ 2 levels "automatic","manual": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: int 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: int 4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars)
| mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. :10.40 | Min. :4.000 | Min. : 71.1 | Min. : 52.0 | Min. :2.760 | Min. :1.513 | Min. :14.50 | V-shaped:18 | automatic:19 | Min. :3.000 | Min. :1.000 | |
| 1st Qu.:15.43 | 1st Qu.:4.000 | 1st Qu.:120.8 | 1st Qu.: 96.5 | 1st Qu.:3.080 | 1st Qu.:2.581 | 1st Qu.:16.89 | straight:14 | manual :13 | 1st Qu.:3.000 | 1st Qu.:2.000 | |
| Median :19.20 | Median :6.000 | Median :196.3 | Median :123.0 | Median :3.695 | Median :3.325 | Median :17.71 | NA | NA | Median :4.000 | Median :2.000 | |
| Mean :20.09 | Mean :6.188 | Mean :230.7 | Mean :146.7 | Mean :3.597 | Mean :3.217 | Mean :17.85 | NA | NA | Mean :3.688 | Mean :2.812 | |
| 3rd Qu.:22.80 | 3rd Qu.:8.000 | 3rd Qu.:326.0 | 3rd Qu.:180.0 | 3rd Qu.:3.920 | 3rd Qu.:3.610 | 3rd Qu.:18.90 | NA | NA | 3rd Qu.:4.000 | 3rd Qu.:4.000 | |
| Max. :33.90 | Max. :8.000 | Max. :472.0 | Max. :335.0 | Max. :4.930 | Max. :5.424 | Max. :22.90 | NA | NA | Max. :5.000 | Max. :8.000 |
library(moments)
hist(mtcars$mpg, probability = T, breaks = 10)
lines(density(mtcars$mpg), lwd=3)
lines(
seq(0, 50, 0.01),
dnorm(seq(0, 50, 0.01), mean = mean(mtcars$mpg), sqrt(var(mtcars$mpg))),
col = "cyan")
abline(v=mean(mtcars$mpg), col = "red")
abline(v=median(mtcars$mpg), col = "green")
skewness(mtcars$mpg)
## [1] 0.6404399
kurtosis(mtcars$mpg)
## [1] 2.799467
Il grafico risulta leggermente decentrato rispetto alla distribuzione normale, proviamo una trasformata
logmpg <- log(mtcars$mpg)
hist(logmpg, probability = T, breaks = 10)
lines(density(logmpg), lwd=3)
lines(
seq(0, 50, 0.01),
dnorm(seq(0, 50, 0.01), mean = mean(logmpg), sqrt(var(logmpg))),
col = "cyan")
abline(v=mean(logmpg), col = "red")
abline(v=median(logmpg), col = "green")
skewness(logmpg)
## [1] -0.02241292
kurtosis(logmpg)
## [1] 2.64908
La trasformazione logaritmica sembra migliorare l’andamento della densità di mpg, visto che ne aumenta sensibilmente la simmetria a discapito di una leggera diminuzione dell’ indice di curtosi.
mtcars$logmpg <- logmpg
#par(mfrow=c(2, 5))
barplot(table(mtcars[, "cyl"]), xlab = "Numero di cilindri")
hist(mtcars$disp, breaks = 10)
hist(mtcars$hp, breaks = 10)
hist(mtcars$drat, breaks = 10)
hist(mtcars$wt, breaks = 10)
hist(mtcars$qsec, breaks = 10)
barplot(table(mtcars[, "vs"]), xlab = "vs")
barplot(table(mtcars[, "am"]), xlab = "am")
barplot(table(mtcars[, "gear"]), xlab = "gear")
barplot(table(mtcars[, "carb"]), xlab = "carb")
#par(mfrow=c(1,1))
for (i in c(2, 8, 9, 10)) {
boxplot(mtcars$logmpg~mtcars[, i], xlab = names(mtcars)[i])
}
for (i in c(3, 4, 5, 6, 7, 11)) {
plot(mtcars$logmpg~mtcars[, i], xlab = names(mtcars)[i])
lines(lowess(mtcars$logmpg~mtcars[, i]))
}
Sembra esserci una forte correlazione lineare tra log(mpg) e cyl, vs, am e wt. Anche per le variabili hp, disp e drat sembra esserci correlazione, ma non di tipo lineare.
Verifichiamo le correlazioni con gli indici di Pearson e Spearman
df <- with(data = mtcars, matrix(
c(cyl, wt, disp, hp),
byrow = FALSE,
nrow = length(mtcars$logmpg),
))
colnames(df) <- c("cyl", "wt", "disp", "hp")
cor(mtcars$logmpg, df, method = "pearson")
| cyl | wt | disp | hp |
|---|---|---|---|
| -0.8546921 | -0.8930611 | -0.8804044 | -0.7894727 |
cor(mtcars$logmpg, df, method = "spearman")
| cyl | wt | disp | hp |
|---|---|---|---|
| -0.9108013 | -0.886422 | -0.9088824 | -0.8946646 |
# ha senso calcolare la correlazione su variabili categoriali?
#cor(mtcars$logmpg, as.numeric(mtcars$vs), method = "pearson")
#cor(mtcars$logmpg, as.numeric(mtcars$vs), method = "spearman")
#cor(mtcars$logmpg, as.numeric(mtcars$am), method = "pearson")
#cor(mtcars$logmpg, as.numeric(mtcars$am), method = "spearman")
Il valore degli indici conferma alcune impressioni, ma ne modifica altre:
la forte correlazione lineare (negativa) è confermata per cyl e wt,
il confronto tra l’indice di pearson e spearman per hp e disp suggerisce la presenza di correlazioni tra queste due variabili e logmpg, ma non di tipo lineare (probabilmente sono inversamente proporzionali),
per quanto riguarda drat, il confronto tra i due indici suggerisce una correlazione lineare in contrasto con l’ipotesi fatta in precedenza. Tuttavia questa correlazione è molto debole e probabilmente dovrà essere ignorata in fase di costruzione del modello.
Proviamo a trasformare hp e disp
plot(mtcars$logmpg~I(1/mtcars$hp))
lines(lowess(mtcars$logmpg~I(1/mtcars$hp)))
plot(mtcars$logmpg~I(1/mtcars$disp))
lines(lowess(mtcars$logmpg~I(1/mtcars$disp)))
df <- with(data = mtcars, matrix(
c(1/disp, 1/hp),
byrow = FALSE,
nrow = length(mtcars$logmpg),
))
colnames(df) <- c("1/disp", "1/hp")
cor(mtcars$logmpg, df, method = "pearson")
| 1/disp | 1/hp |
|---|---|
| 0.8913563 | 0.8366077 |
I grafici e il valore degli indici di pearson confermano la bontà delle due trasformate.
After fitting the model fit <- lm(mpg ∼ disp + hp + wt + drat, data=mtcars), the following outputs are obtained by the R commands summary(fit) and plot(fit), respectively. Describe how to interpret these results, and then suggest how to proceed with further analyses.
##
## Call:
## lm(formula = mpg ~ disp + hp + wt + drat, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5077 -1.9052 -0.5057 0.9821 5.6883
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.148738 6.293588 4.631 8.2e-05 ***
## disp 0.003815 0.010805 0.353 0.72675
## hp -0.034784 0.011597 -2.999 0.00576 **
## wt -3.479668 1.078371 -3.227 0.00327 **
## drat 1.768049 1.319779 1.340 0.19153
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.602 on 27 degrees of freedom
## Multiple R-squared: 0.8376, Adjusted R-squared: 0.8136
## F-statistic: 34.82 on 4 and 27 DF, p-value: 2.704e-10
plot(fit, which = c(1:6), lwd=2)
Dal summary si può notare l’intercetta molto elevata e gli elevati valori dei p-value soprattutto per quanto riguarda disp e drat, suggerendo il fatto che non siano significativi sulla variabile risposta.
Dai grafici diagnostici risulta infatti un’andamento dei residui non costante e sia il qq-plot che il grafico dei residui standardizzati confermano la pesantezza delle code già rilevate in fase di analisi esplorativa. Si può notare infatti come ai valori estremi il valore dei residui tenda ad aumentare.
Questa impressione è confermata anche dai grafici su effetto leva e distanza di Cook, con la presenza di tre osservazioni (Toyota Corolla, Chrysler Imperial, Maseratti Bora) evidenziata in praticamente tutti i grafici, a conferma della loro influenza sul valore di mpg nel modello.
Proponiamo un modello alternativo sulla base delle osservazioni fatte in fase di EDA
mtcars.fit <- lm(logmpg ~ cyl + wt + I(1/disp) + I(1/hp), data = mtcars)
summary(mtcars.fit)
##
## Call:
## lm(formula = logmpg ~ cyl + wt + I(1/disp) + I(1/hp), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.15842 -0.08823 -0.02354 0.08879 0.21965
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.35417 0.26674 12.575 8.44e-13 ***
## cyl -0.02094 0.02526 -0.829 0.414396
## wt -0.15294 0.03753 -4.076 0.000362 ***
## I(1/disp) 9.87334 15.63627 0.631 0.533064
## I(1/hp) 19.59982 10.17911 1.925 0.064760 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1065 on 27 degrees of freedom
## Multiple R-squared: 0.8885, Adjusted R-squared: 0.872
## F-statistic: 53.79 on 4 and 27 DF, p-value: 1.785e-12
plot(mtcars.fit, which = 1:6)
Questo modello sembra essere decisamente migliore rispetto a quello proposto in precedenza (AIC molto più basso, indici \(R^2\) più elevati), tuttavia si nota sia un peggioramento nell’andamento sia del qq-plot che dei residui standardizzati, inoltre nonostante un p-value per il test F molto basso, i singoli p-value per i regressori sono elevati, sintomo di una collinearità non rilevata in precedenza.
library(DAAG)
## Loading required package: lattice
vif(mtcars.fit)
## cyl wt I(1/disp) I(1/hp)
## 5.5598 3.6819 8.5939 4.7799
L’analisi VIF conferma la presenza di collinearità nel modello.
mtcars.fit2 <- lm(logmpg ~ wt + I(1/hp), data = mtcars)
summary(mtcars.fit2)
##
## Call:
## lm(formula = logmpg ~ wt + I(1/hp), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.156884 -0.083000 -0.008951 0.083485 0.242816
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.29685 0.13518 24.389 < 2e-16 ***
## wt -0.18351 0.02757 -6.656 2.68e-07 ***
## I(1/hp) 29.68441 6.56369 4.523 9.54e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1061 on 29 degrees of freedom
## Multiple R-squared: 0.8813, Adjusted R-squared: 0.8731
## F-statistic: 107.6 on 2 and 29 DF, p-value: 3.805e-14
plot(mtcars.fit2, which = 1:6)
vif(mtcars.fit2)
## wt I(1/hp)
## 2.0048 2.0048
AIC(fit, mtcars.fit, mtcars.fit2)
| df | AIC | |
|---|---|---|
| fit | 6 | 158.58366 |
| mtcars.fit | 6 | -45.93601 |
| mtcars.fit2 | 4 | -47.92801 |
anova(mtcars.fit2, mtcars.fit)
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 29 | 0.3263152 | NA | NA | NA | NA |
| 27 | 0.3064681 | 2 | 0.0198471 | 0.8742682 | 0.4286456 |
Il nuovo modello sembra migliore, come confermato dall’analisi vif, dai valori dei p-value sui regressori e globale, dall’analisi ANOVA e dalla statistica AIC
Let us consider the dataframe bp.obese of the library ISwR, which comprises information about sex, obesity and blood pressure for a random sample of 102 Mexican-American adults in a small California town. The help file and the output of the str command are given below
##
## Attaching package: 'ISwR'
## The following object is masked from 'package:DAAG':
##
## lung
| bp.obese | R Documentation |
This data frame contains the following columns:
sex
a numeric vector code, 0: male, 1: female.
obese
a numeric vector, ratio of actual weight to ideal weight from New York Metropolitan Life Tables.
bp
a numeric vector,systolic blood pressure (mm Hg).
| sex | obese | bp | |
|---|---|---|---|
| Min. :0.0000 | Min. :0.810 | Min. : 94.0 | |
| 1st Qu.:0.0000 | 1st Qu.:1.143 | 1st Qu.:116.0 | |
| Median :1.0000 | Median :1.285 | Median :124.0 | |
| Mean :0.5686 | Mean :1.313 | Mean :127.0 | |
| 3rd Qu.:1.0000 | 3rd Qu.:1.430 | 3rd Qu.:137.5 | |
| Max. :1.0000 | Max. :2.390 | Max. :208.0 |
The aim of the study is to analyze the potential relationship between blood pressure, which is the response variable, and obesity, taking into account also the factor regressor sex. Describe how to perform a preliminary data analysis on this dataframe, using suitable R commands and comment the following plot.
### Analisi preliminare
Partiamo con una panoramica rapida del dataset.
help(bp.obese)
| bp.obese | R Documentation |
The bp.obese data frame has 102 rows and 3 columns. It contains data from a random sample of Mexican-American adults in a small California town.
bp.obese
This data frame contains the following columns:
sex
a numeric vector code, 0: male, 1: female.
obese
a numeric vector, ratio of actual weight to ideal weight from New York Metropolitan Life Tables.
bp
a numeric vector,systolic blood pressure (mm Hg).
B.W. Brown and M. Hollander (1977), Statistics: A Biomedical Introduction, Wiley.
plot(bp~obese,pch = ifelse(sex==1, "F", "M"), data = bp.obese)
summary(bp.obese)
| sex | obese | bp | |
|---|---|---|---|
| Min. :0.0000 | Min. :0.810 | Min. : 94.0 | |
| 1st Qu.:0.0000 | 1st Qu.:1.143 | 1st Qu.:116.0 | |
| Median :1.0000 | Median :1.285 | Median :124.0 | |
| Mean :0.5686 | Mean :1.313 | Mean :127.0 | |
| 3rd Qu.:1.0000 | 3rd Qu.:1.430 | 3rd Qu.:137.5 | |
| Max. :1.0000 | Max. :2.390 | Max. :208.0 |
Il dataset è costituito da tre variabili, due numeriche continue (obese e bp) e una categoriale non ordinale (sex).
After fitting these linear models fit1 <- lm(bp ∼ obese,data=bp.obese), fit2 <- lm(bp ∼ obese+sex,data=bp.obese) and fit3 <- lm(bp ∼ obese*sex,data=bp.obese), the following outputs are obtained by the R function summary.
##
## Call:
## lm(formula = bp ~ obese, data = bp.obese)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.570 -11.241 -2.400 9.116 71.390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 96.818 8.920 10.86 < 2e-16 ***
## obese 23.001 6.667 3.45 0.000822 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.28 on 100 degrees of freedom
## Multiple R-squared: 0.1064, Adjusted R-squared: 0.09743
## F-statistic: 11.9 on 1 and 100 DF, p-value: 0.0008222
##
## Call:
## lm(formula = bp ~ obese + sex, data = bp.obese)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.263 -11.613 -2.057 6.424 72.207
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 93.287 8.937 10.438 < 2e-16 ***
## obese 29.038 7.172 4.049 0.000102 ***
## sex -7.730 3.715 -2.081 0.040053 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17 on 99 degrees of freedom
## Multiple R-squared: 0.1438, Adjusted R-squared: 0.1265
## F-statistic: 8.314 on 2 and 99 DF, p-value: 0.0004596
##
## Call:
## lm(formula = bp ~ obese * sex, data = bp.obese)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.645 -11.621 -1.708 6.737 71.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 102.112 18.231 5.601 1.95e-07 ***
## obese 21.646 15.118 1.432 0.155
## sex -19.596 21.664 -0.905 0.368
## obese:sex 9.558 17.191 0.556 0.579
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.05 on 98 degrees of freedom
## Multiple R-squared: 0.1465, Adjusted R-squared: 0.1204
## F-statistic: 5.607 on 3 and 98 DF, p-value: 0.001368
Describe how to interpret these results, and then suggest how to proceed with further analyses.
Let us consider the dataframe SAheart of the library ElemStatLearn, which comprises information about a retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa. The help file and the output of the str command are given below
##
## Attaching package: 'ElemStatLearn'
## The following object is masked from 'package:DAAG':
##
## ozone
| SAheart | R Documentation |
A data frame with 462 observations on the following 10 variables.
systolic blood pressure
cumulative tobacco (kg)
low density lipoprotein cholesterol
a numeric vector
family history of heart disease, a factor with levels Absent Present
type-A behavior
a numeric vector
current alcohol consumption
age at onset
response, coronary heart disease
## 'data.frame': 462 obs. of 10 variables:
## $ sbp : int 160 144 118 170 134 132 142 114 114 132 ...
## $ tobacco : num 12 0.01 0.08 7.5 13.6 6.2 4.05 4.08 0 0 ...
## $ ldl : num 5.73 4.41 3.48 6.41 3.5 6.47 3.38 4.59 3.83 5.8 ...
## $ adiposity: num 23.1 28.6 32.3 38 27.8 ...
## $ famhist : Factor w/ 2 levels "Absent","Present": 2 1 2 2 2 2 1 2 2 2 ...
## $ typea : int 49 55 52 51 60 62 59 62 49 69 ...
## $ obesity : num 25.3 28.9 29.1 32 26 ...
## $ alcohol : num 97.2 2.06 3.81 24.26 57.34 ...
## $ age : int 52 63 46 58 49 45 38 58 29 53 ...
## $ chd : int 1 1 0 1 1 0 0 1 0 1 ...
The aim of the study is to analyze the potential relationship between the binary response variable chd and the explanatory variables considered in the dataframe. Describe how to perform a preli- minary data analysis on this dataframe, using suitable R commands, and comment the following plot.
With the command mod0 <- glm(chd ∼ ldl, data = SAheart, family = binomial), a simple logistic regression model is defined for describing the potential effect of the level of ldl on the probability of coronary heart disease. Comment the model fitting outcomes given by the function summary.
##
## Call:
## glm(formula = chd ~ ldl, family = binomial, data = SAheart)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1647 -0.8948 -0.7426 1.2688 1.8637
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.96867 0.27308 -7.209 5.63e-13 ***
## ldl 0.27466 0.05164 5.319 1.04e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 596.11 on 461 degrees of freedom
## Residual deviance: 564.28 on 460 degrees of freedom
## AIC: 568.28
##
## Number of Fisher Scoring iterations: 4
After fitting these two further logistic regression models mod1 <- glm(chd ∼ ., data = SAheart, family = binomial) and mod2 <- glm(chd ∼ tobacco + ldl + famhist + typea + age + ldl:famhist, data = SAheart, family = binomial), the following outputs are obtained by the R function summary.
##
## Call:
## glm(formula = chd ~ ., family = binomial, data = SAheart)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7781 -0.8213 -0.4387 0.8889 2.5435
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.1507209 1.3082600 -4.701 2.58e-06 ***
## sbp 0.0065040 0.0057304 1.135 0.256374
## tobacco 0.0793764 0.0266028 2.984 0.002847 **
## ldl 0.1739239 0.0596617 2.915 0.003555 **
## adiposity 0.0185866 0.0292894 0.635 0.525700
## famhistPresent 0.9253704 0.2278940 4.061 4.90e-05 ***
## typea 0.0395950 0.0123202 3.214 0.001310 **
## obesity -0.0629099 0.0442477 -1.422 0.155095
## alcohol 0.0001217 0.0044832 0.027 0.978350
## age 0.0452253 0.0121298 3.728 0.000193 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 596.11 on 461 degrees of freedom
## Residual deviance: 472.14 on 452 degrees of freedom
## AIC: 492.14
##
## Number of Fisher Scoring iterations: 5
##
## Call:
## glm(formula = chd ~ tobacco + ldl + famhist + typea + age + ldl:famhist,
## family = binomial, data = SAheart)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8463 -0.7938 -0.4419 0.9161 2.4956
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.79224 0.94625 -6.121 9.28e-10 ***
## tobacco 0.08496 0.02628 3.233 0.00122 **
## ldl 0.01758 0.07302 0.241 0.80974
## famhistPresent -0.77068 0.62341 -1.236 0.21637
## typea 0.03690 0.01240 2.974 0.00294 **
## age 0.05140 0.01030 4.990 6.03e-07 ***
## ldl:famhistPresent 0.33334 0.11595 2.875 0.00404 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 596.11 on 461 degrees of freedom
## Residual deviance: 466.90 on 455 degrees of freedom
## AIC: 480.9
##
## Number of Fisher Scoring iterations: 5
Describe how to interpret these results, and then suggest how to proceed with further analyses.
Let us consider the dataframe house, which includes information about the price, the size, the floor, the number of bedrooms (bed) and the number of bathrooms (bath) of 546 houses. The output of the str command is given below
## 'data.frame': 546 obs. of 5 variables:
## $ price: num 42000 38500 49500 60500 61000 66000 66000 69000 83800 88500 ...
## $ size : int 5850 4000 3060 6650 6360 4160 3880 4160 4800 5500 ...
## $ bed : int 3 2 3 3 2 3 3 3 3 3 ...
## $ bath : int 1 1 1 1 1 1 2 1 1 2 ...
## $ floor: int 2 1 1 2 1 1 2 3 1 4 ...
A suitable linear regression model can be defined in order to study the potential relationship between the price, which is the response variable, and the explanatory variables considered in the dataframe. Describe how to perform a preliminary data analysis on this dataframe, using suitable R commands. Moreover, consider the following plots and discuss the possibility of measuring the variables price and size in the logarithmic scale.
After fitting the regression model fit <- lm(log(price) ∼ log(size) + bed + bath + floor, data=house), the following outputs are obtained by the R commands summary(fit) and plot(fit), respectively.
Describe how to interpret these results, and then suggest how to proceed with further analyses with particular regard to prediction.
Let us consider the dataframe wages, which containes information about 3294 USA working individuals. The data are taken from the National Longitudinal Survey and are related to 1987. The variable as are listed below and the output of the str command is given
## 'data.frame': 3296 obs. of 5 variables:
## $ exper : int 9 12 11 9 8 9 8 10 12 7 ...
## $ male : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ school: int 13 12 11 14 14 14 12 12 10 12 ...
## $ wage : num 6.32 5.48 3.64 4.59 2.42 ...
## $ region: Factor w/ 3 levels "Center","North",..: 1 1 3 1 1 1 1 1 1 3 ...
The aim of the study is to analyze the potential relationship between the response variable wage and the explanatory variables considered in the dataframe. Describe how to perform a preliminary data analysis on this dataframe, using suitable R commands. Moreover, consider the following plots and discuss the possibility of measuring the variable wage in the logarithmic scale
In order to describe the effect of the factor male on the response log(wage) we may analize this plot, where the probability distribution of log(wage) is represented by considering the kernel density estimates conditioned on the two levels (1 male, 0 female) of the variable male
With the commands mod.0<-lm(log(wage) ∼ male,data=wages) and mod.1<-lm(log(wage) ∼ exper*male, data=wages), two regression models are defined for describing the potential effect of male and exper on the response log(wage). Comment the model fitting outcomes given by the function summary (Hint: consider the fact that the average years of experience in the sample is lower for women than for men).
##
## Call:
## lm(formula = log(wage) ~ male, data = wages)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0445 -0.3073 0.0544 0.3839 3.0325
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.47475 0.01559 94.59 <2e-16 ***
## male1 0.21826 0.02154 10.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6176 on 3294 degrees of freedom
## Multiple R-squared: 0.03023, Adjusted R-squared: 0.02994
## F-statistic: 102.7 on 1 and 3294 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = log(wage) ~ exper * male, data = wages)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0906 -0.3050 0.0560 0.3792 3.0468
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.193551 0.057470 20.768 < 2e-16 ***
## exper 0.036367 0.007156 5.082 3.94e-07 ***
## male1 0.463707 0.079062 5.865 4.93e-09 ***
## exper:male1 -0.032071 0.009518 -3.369 0.000762 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6153 on 3292 degrees of freedom
## Multiple R-squared: 0.03792, Adjusted R-squared: 0.03704
## F-statistic: 43.25 on 3 and 3292 DF, p-value: < 2.2e-16
Finally, a complete regression model is fitted mod.2 <- lm(log(wage) ∼., data=wages) and the following output is obtained by the R function summary.
##
## Call:
## lm(formula = log(wage) ~ ., data = wages)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0008 -0.2821 0.0468 0.3673 3.2337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.279709 0.090321 -3.097 0.00197 **
## exper 0.034618 0.004549 7.610 3.55e-14 ***
## male1 0.246474 0.020607 11.961 < 2e-16 ***
## school 0.122909 0.006278 19.578 < 2e-16 ***
## regionNorth 0.051107 0.024505 2.086 0.03709 *
## regionSouth 0.047168 0.024969 1.889 0.05898 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5831 on 3290 degrees of freedom
## Multiple R-squared: 0.1364, Adjusted R-squared: 0.1351
## F-statistic: 103.9 on 5 and 3290 DF, p-value: < 2.2e-16
Describe how to interpret these results, and then suggest how to proceed with further analyses.
Let us consider the data frame loan, which contains information about 42,535 loans ranging from 1,000 $ to 35,000 $, issued by a company called Lending Club. The following variables are considered: good (the behaviour of the client with values good and bad), fico (the FICO credit score measuring the client credit worthiness), purpose (the intended use of the loan, with 8 different categories), loan amt (the credit amount in $) and income (the annual income in $ of the client). The variable as are listed below and the output of the str command is given
## 'data.frame': 42535 obs. of 5 variables:
## $ good : Factor w/ 2 levels "bad","good": 2 1 2 2 2 2 2 2 1 1 ...
## $ purpose : Factor w/ 8 levels "debt_consolidation",..: 1 4 7 6 6 8 1 4 7 6 ...
## $ fico : int 737 742 737 692 697 732 692 662 677 727 ...
## $ loan_amnt: int 5000 2500 2400 10000 3000 5000 7000 3000 5600 5375 ...
## $ income : num 24000 30000 NA 49200 80000 36000 NA 48000 40000 15000 ...
Moreover, the output of the command summary is also given
| good | purpose | fico | loan_amnt | income | |
|---|---|---|---|---|---|
| bad : 6371 | debt_consolidation:25253 | Min. :612.0 | Min. : 500 | Min. : 4800 | |
| good:36164 | other : 5160 | 1st Qu.:687.0 | 1st Qu.: 5200 | 1st Qu.: 44995 | |
| NA | major_purchase : 3926 | Median :712.0 | Median : 9700 | Median : 63000 | |
| NA | home_improvement : 3625 | Mean :715.1 | Mean :11090 | Mean : 75186 | |
| NA | small_business : 1992 | 3rd Qu.:742.0 | 3rd Qu.:15000 | 3rd Qu.: 90000 | |
| NA | vacation_wedding : 1404 | Max. :827.0 | Max. :35000 | Max. :6000000 | |
| NA | (Other) : 1175 | NA | NA | NA’s :18758 |
The aim of the study is to analyze the potential relationship between the response variable good and the explanatory variables considered in the data frame, in order to evaluate the possible good/bad behaviour of a customer. Describe how to perform a preliminary data analysis on this data frame, using suitable R commands. Moreover, consider and discuss the following plots
In order to describe the effect of the factor fico on the response good we consider a simple logistic regression model fitted using the command mod.1<-glm(good ∼ fico, data = loan, family = "binomial"). Comment the model fitting outcomes given by the function summary and the output given by the subsequent commands.
mod.1 <- glm(good ~ fico, data = loan, family = "binomial")
summary(mod.1)
##
## Call:
## glm(formula = good ~ fico, family = "binomial", data = loan)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5172 0.4078 0.5306 0.6294 0.9622
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.033068 0.296922 -23.69 <2e-16 ***
## fico 0.012358 0.000421 29.35 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 35928 on 42534 degrees of freedom
## Residual deviance: 34985 on 42533 degrees of freedom
## AIC: 34989
##
## Number of Fisher Scoring iterations: 5
exp(coef(mod.1))
## (Intercept) fico
## 0.0008822209 1.0124345145
test <- data.frame(fico=c(700,750))
test$pred <- predict(mod.1,test, type="response")
test
| fico | pred |
|---|---|
| 700 | 0.8344391 |
| 750 | 0.9033761 |
Two further logistic regression models are fitted using mod.2<-glm(good ∼ fico + loan_amnt, data = loan, family = "binomial") and mod.3<-glm(good ∼ fico + loan_amnt + income + purpose, data = loan, family = "binomial"). Comment the corresponding output obtained by the R function summary and then suggest how to proceed with a further predictive analysis.
##
## Call:
## glm(formula = good ~ fico + loan_amnt, family = "binomial", data = loan)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6261 0.4011 0.5256 0.6261 0.9423
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.367e+00 3.007e-01 -24.50 <2e-16 ***
## fico 1.319e-02 4.306e-04 30.62 <2e-16 ***
## loan_amnt -2.229e-05 1.815e-06 -12.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 35928 on 42534 degrees of freedom
## Residual deviance: 34838 on 42532 degrees of freedom
## AIC: 34844
##
## Number of Fisher Scoring iterations: 5
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Call:
## glm(formula = good ~ fico + loan_amnt + income + purpose, family = "binomial",
## data = loan)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3659 0.3840 0.5224 0.6320 1.2589
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.482e+00 4.119e-01 -18.165 < 2e-16 ***
## fico 1.303e-02 5.906e-04 22.061 < 2e-16 ***
## loan_amnt -3.663e-05 2.580e-06 -14.200 < 2e-16 ***
## income 7.203e-06 5.426e-07 13.275 < 2e-16 ***
## purposeeducational -5.076e-01 2.309e-01 -2.198 0.0279 *
## purposehome_improvement -1.077e-01 7.108e-02 -1.515 0.1298
## purposemajor_purchase 1.388e-02 7.689e-02 0.180 0.8568
## purposemedical -2.678e-01 1.426e-01 -1.878 0.0604 .
## purposeother -2.988e-01 6.034e-02 -4.952 7.34e-07 ***
## purposesmall_business -9.158e-01 7.016e-02 -13.052 < 2e-16 ***
## purposevacation_wedding 1.114e-01 1.126e-01 0.989 0.3226
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20592 on 23776 degrees of freedom
## Residual deviance: 19677 on 23766 degrees of freedom
## (18758 observations deleted due to missingness)
## AIC: 19699
##
## Number of Fisher Scoring iterations: 5
Let us consider the dataframe birthwt, which contains data on 189 births at the Baystate Medical Centre, Springfield, Massachusetts during 1986. The focus is on the variables listed below
##
## Attaching package: 'MASS'
## The following object is masked from 'package:DAAG':
##
## hills
| birthwt | R Documentation |
This data frame contains the following columns:
low
indicator of birth weight less than 2.5 kg.
age
mother’s age in years.
lwt
mother’s weight in pounds at last menstrual period.
race
mother’s race (1 = white, 2 = black, 3 = other).
smoke
smoking status during pregnancy.
ptl
number of previous premature labours.
ht
history of hypertension.
ui
presence of uterine irritability.
ftv
number of physician visits during the first trimester.
bwt
birth weight in grams.
The aim of the study is to analyze the potential relationship between the response variable bwt and the explanatory variables age and race. Describe how to perform a preliminary data analysis on this dataframe using suitable R commands and comment the following plots
In order to describe the potential relationship between birth weight and age, taking into account also the factor race, we compare the following nested models
bwt.lm1 <- lm(bwt ~ 1 , data = birthwt)
bwt.lm2 <- lm(bwt ~ age, data = birthwt)
bwt.lm3 <- lm(bwt ~ race + age, data = birthwt)
bwt.lm4 <- lm(bwt ~ race*age, data = birthwt)
Describe the four models and comment the results given by the Analysis of Variance Table, reported below. Moreover, propose some alternative model selection procedures.
anova(bwt.lm1, bwt.lm2, bwt.lm3, bwt.lm4)
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 188 | 99969656 | NA | NA | NA | NA |
| 187 | 99154173 | 1 | 815483.2 | 1.590830 | 0.2087958 |
| 186 | 95848562 | 1 | 3305610.3 | 6.448524 | 0.0119274 |
| 185 | 94833785 | 1 | 1014776.9 | 1.979608 | 0.1611095 |
Let us consider Model 3 and comment the output obtained by the R functions summary and plot.
##
## Call:
## lm(formula = bwt ~ race + age, data = birthwt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2104.8 -485.7 11.3 536.0 1746.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3028.482 277.679 10.906 <2e-16 ***
## race -146.597 57.881 -2.533 0.0121 *
## age 8.039 10.032 0.801 0.4240
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 717.9 on 186 degrees of freedom
## Multiple R-squared: 0.04122, Adjusted R-squared: 0.03091
## F-statistic: 3.999 on 2 and 186 DF, p-value: 0.01994
Finally, discuss the following graphical output and then suggest how to proceed with further analyses.
Let us consider the dataframe wine, which contains information about 178 samples of wines grown in the same region in Italy. The cultivar of each wine sample is observed (variable cultivar, with labels 1, 2, 3), together with the concentration of the 13 different chemicals (variables V1-V13). Describe how to perform a preliminary data analysis on this dataframe using suitable R commands and comment the following outputs.
wine <- read.table("./data/wine.txt", header = TRUE)
summary(wine)
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. :1.000 | Min. :11.03 | Min. :0.740 | Min. :1.360 | Min. :10.60 | Min. : 70.00 | Min. :0.980 | Min. :0.340 | Min. :0.1300 | Min. :0.410 | Min. : 1.280 | Min. :0.4800 | Min. :1.270 | Min. : 278.0 | |
| 1st Qu.:1.000 | 1st Qu.:12.36 | 1st Qu.:1.603 | 1st Qu.:2.210 | 1st Qu.:17.20 | 1st Qu.: 88.00 | 1st Qu.:1.742 | 1st Qu.:1.205 | 1st Qu.:0.2700 | 1st Qu.:1.250 | 1st Qu.: 3.220 | 1st Qu.:0.7825 | 1st Qu.:1.938 | 1st Qu.: 500.5 | |
| Median :2.000 | Median :13.05 | Median :1.865 | Median :2.360 | Median :19.50 | Median : 98.00 | Median :2.355 | Median :2.135 | Median :0.3400 | Median :1.555 | Median : 4.690 | Median :0.9650 | Median :2.780 | Median : 673.5 | |
| Mean :1.938 | Mean :13.00 | Mean :2.336 | Mean :2.367 | Mean :19.49 | Mean : 99.74 | Mean :2.295 | Mean :2.029 | Mean :0.3619 | Mean :1.591 | Mean : 5.058 | Mean :0.9574 | Mean :2.612 | Mean : 746.9 | |
| 3rd Qu.:3.000 | 3rd Qu.:13.68 | 3rd Qu.:3.083 | 3rd Qu.:2.558 | 3rd Qu.:21.50 | 3rd Qu.:107.00 | 3rd Qu.:2.800 | 3rd Qu.:2.875 | 3rd Qu.:0.4375 | 3rd Qu.:1.950 | 3rd Qu.: 6.200 | 3rd Qu.:1.1200 | 3rd Qu.:3.170 | 3rd Qu.: 985.0 | |
| Max. :3.000 | Max. :14.83 | Max. :5.800 | Max. :3.230 | Max. :30.00 | Max. :162.00 | Max. :3.880 | Max. :5.080 | Max. :0.6600 | Max. :3.580 | Max. :13.000 | Max. :1.7100 | Max. :4.000 | Max. :1680.0 |
sapply(wine[2:14],sd)
## V2 V3 V4 V5 V6 V7
## 0.8118265 1.1171461 0.2743440 3.3395638 14.2824835 0.6258510
## V8 V9 V10 V11 V12 V13
## 0.9988587 0.1244533 0.5723589 2.3182859 0.2285716 0.7099904
## V14
## 314.9074743
Moreover, discuss the results given by the scatterplot matrix considered below, which considers the first 5 numerical variables, with colours indicating cultivar 1 (black), cultivar 2 (red) and cultivar 3 (blue).
The aim of the study is to adequately synthesize the information given by the original variables V1-V13, in order to capture as much of the information as possible. A further objective is to use some of these new derived variable for distinguishing the three different cultivars. The Principal Components Analysis procedure is applied. Present the main features of this stati- stical procedure, describe the arguments specified below in the function princomp and discuss the output of the function loadings.
wine.pca <- princomp(wine[2:14], cor=TRUE)
loadings(wine.pca)[,1:4]
| Comp.1 | Comp.2 | Comp.3 | Comp.4 | |
|---|---|---|---|---|
| V2 | 0.1443294 | 0.4836515 | 0.2073826 | 0.0178563 |
| V3 | -0.2451876 | 0.2249309 | -0.0890129 | -0.5368903 |
| V4 | -0.0020511 | 0.3160688 | -0.6262239 | 0.2141756 |
| V5 | -0.2393204 | -0.0105905 | -0.6120803 | -0.0608594 |
| V6 | 0.1419920 | 0.2996340 | -0.1307569 | 0.3517966 |
| V7 | 0.3946608 | 0.0650395 | -0.1461790 | -0.1980683 |
| V8 | 0.4229343 | -0.0033598 | -0.1506819 | -0.1522948 |
| V9 | -0.2985331 | 0.0287795 | -0.1703682 | 0.2033010 |
| V10 | 0.3134295 | 0.0393017 | -0.1494543 | -0.3990565 |
| V11 | -0.0886167 | 0.5299957 | 0.1373062 | -0.0659257 |
| V12 | 0.2967146 | -0.2792351 | -0.0852219 | 0.4277714 |
| V13 | 0.3761674 | -0.1644962 | -0.1660046 | -0.1841207 |
| V14 | 0.2867522 | 0.3649028 | 0.1267459 | 0.2320709 |
Moreover, discuss the following graphical outputs
Finally, comment this last plot, with particular concern to the aim of characterizing the three different cultivars.